Behavioral synthesis apparatus, behavioral synthesis method, data processing system including behavioral synthesis apparatus, and non-transitory computer readable medium storing behavioral synthesis program

ABSTRACT

A behavioral synthesis apparatus includes a determination unit that determines whether or not a loop description should be converted into a pipeline, and a synthesis unit that performs behavioral synthesis while setting a stricter delay constraint for a loop description that is converted into a pipeline than a loop description that is not converted into a pipeline.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.13/922,945 filed Jun. 20, 2013 which is claiming priority from Japanesepatent application No. 2012-141058, filed on Jun. 22, 2012, thedisclosure of which is incorporated herein in its entirety by reference.

BACKGROUND

The present invention relates to a behavioral synthesis apparatus, abehavioral synthesis method, a data processing system including abehavioral synthesis apparatus, and a non-transitory computer readablemedium storing a behavioral synthesis program.

The development of a behavioral synthesis apparatus that automaticallygenerates a code of a circuit structure (structural code) such as an RTL(Register Transfer Level) code from a code of a circuit behavior(behavioral code) by using a C-language or the like has been underway.In recent years, in particular, it has been desired to develop abehavioral synthesis apparatus capable of generating an RTL code with ahigh throughput (processing capability).

Japanese Patent No. 4770657 discloses a related art. A pipelinesynthesis system disclosed in Japanese Patent No. 4770657 generates anRTL code that performs a pipeline operation from a loop descriptionincluded in a behavioral code. In this way, this pipeline synthesissystem generates an RTL code that reduces the number of execution cyclesand thereby achieves a high throughput.

The RTL code generated by the above-described behavioral synthesisapparatus is converted into an object code through placing/routingprocessing and the like. Then, the converted object code is used as acircuit for an FPGA (Field Programmable Gate Array) or for a rewritableprogrammable device such as a dynamically-reconfigurable processor.

Japanese Patent No. 3921367 discloses a related art. A parallelarithmetic apparatus disclosed in Japanese Patent No. 3921367 changes acontext (operating state) for each state based on an object codesupplied from a data processing apparatus and operates a plurality ofprocessing circuits in parallel. This parallel arithmetic apparatus canreconfigure the plurality of processing circuits according to the state(i.e., can dynamically reconfigure the plurality of processingcircuits). Therefore, this parallel arithmetic apparatus can executecomplex processing with a small circuit scale.

SUMMARY

The present inventors have found the following problem. When a loopdescription is synthesized as a pipeline circuit, if the delay is set toa small value (if the delay constraint is made stricter), a number ofresisters are inserted. As a result, the number of pipeline stagesincreases. However, since the number of states is folded by theconversion into pipelines, the number of execution cycles does notchange except for the initialization (prologue) and the postprocessing(epilogue). Therefore, in pipeline circuits, the smaller value the delayis set to (the stricter the delay constraint is made), the more thethroughput (processing capability) improves.

In contrast to this, when a loop description is synthesized as amulti-state circuit without converting into a pipeline, if the delay isset to a small value (if delay constraint is made stricter), a number ofresisters are inserted. Therefore, the number of states increases. As aresult, the number of execution cycles also increases. Therefore, inmulti-state circuits, when the increase in the processing time due tothe increase in the number of execution cycles exceeds the decrease inthe processing time by the reduction in the delay, the throughput(processing capability) deteriorates. In general, in multi-statecircuits, the smaller value the delay is set to (the stricter the delayconstraint is made), the larger the ratio of the total time of the setuptime and the hold time of a register, a memory, or the like becomes.Therefore, the ratio of the time spent for the calculation itselfdecreases and thus the throughput tends to deteriorate.

Note that Japanese Patent No. 4770657 does not state in what manner thepipeline synthesis system sets the delay constraint when scheduling andallocation are performed. Therefore, it is presumed that this pipelinesynthesis system performs scheduling and allocation while setting auniform delay constraint over the entire circuit regardless of whether aloop description is synthesized as a pipeline circuit or not.

Therefore, there is a problem that when the delay is set to a smallvalue (when the delay constraint is made stricter), this pipelinesynthesis system cannot improve the throughput of a multi-state circuit,whereas when the delay is set to a large value (when the delayconstraint is relaxed), the pipeline synthesis system cannot improve thethroughput of a pipeline circuit. In other words, there is a problemthat the related-art pipeline synthesis system cannot generate an RTLcode having a high throughput.

Other problems to be solved and novel features of the present inventionwill be more apparent from the following descriptions of thisspecification and the accompanying drawings.

A first aspect of the present invention is a behavioral synthesisapparatus including: a determination unit that determines whether or nota loop description should be converted into a pipeline; and a synthesisunit that performs behavioral synthesis while setting a stricter delayconstraint for a loop description that is converted into a pipeline thana loop description that is not converted into a pipeline.

Further, another aspect of the present invention is a behavioralsynthesis method including performing behavioral synthesis while settinga stricter delay constraint for a loop description that is convertedinto a pipeline than a loop description that is not converted into apipeline.

Further, another aspect of the present invention is a non-transitorycomputer readable medium storing a behavioral synthesis program thatcauses a computer to execute: a determination process of determiningwhether or not a loop description should be converted into a pipeline;and a behavioral synthesis process of performing behavioral synthesiswhile setting a stricter delay constraint for a loop description that isconverted into a pipeline than a loop description that is not convertedinto a pipeline.

According to the above-described aspect of the present invention, it ispossible to provide a behavioral synthesis apparatus capable ofgenerating an RTL code having a high throughput.

BRIEF DESCRIPTION OF THE DRAWINGS

The above and other aspects, advantages and features will be moreapparent from the following description of certain embodiments taken inconjunction with the accompanying drawings, in which:

FIG. 1 shows an example of a logic configuration of a data processingapparatus according to a first embodiment;

FIG. 2 is a conceptual diagram for explaining a behavioral synthesisunit according to a first embodiment;

FIG. 3A is a conceptual diagram for explaining a conversion into apipeline;

FIG. 3B is a conceptual diagram for explaining a conversion into apipeline;

FIG. 3C is a conceptual diagram for explaining a conversion into apipeline;

FIG. 4 is a conceptual diagram for explaining a data hazard;

FIG. 5 is a flowchart showing an operation of a behavioral synthesisunit according to a first embodiment;

FIG. 6 is a block diagram showing a hardware configuration of a dataprocessing apparatus according to a first embodiment;

FIG. 7 is a block diagram showing a configuration example of anarray-type processor according to a second embodiment;

FIG. 8 shows a configuration example of a processor element and a switchelement according to a second embodiment;

FIG. 9 is a block diagram showing a configuration example of a dataprocessing system according to a second embodiment;

FIG. 10A shows a connection relation between arithmetic units andregisters;

FIG. 10B shows a connection relation between arithmetic units andregisters;

FIG. 11A is a block diagram showing a configuration example of anarithmetic unit;

FIG. 11B is a block diagram showing a configuration example of anarithmetic unit according to a third embodiment;

FIG. 11C is a block diagram showing a configuration example of anarithmetic unit according to a third embodiment;

FIG. 11D is a block diagram showing a configuration example of anarithmetic unit according to a third embodiment;

FIG. 12A is a block diagram showing a configuration example of a memoryunit;

FIG. 12B is a block diagram showing a configuration example of a memoryunit according to a third embodiment;

FIG. 12C is a block diagram showing a configuration example of a memoryunit according to a third embodiment;

FIG. 13A shows a configuration example of a register unit according to athird embodiment;

FIG. 13B is a block diagram showing a part of an array-type processoraccording to a third embodiment;

FIG. 14 is a flowchart showing an operation of a behavioral synthesisunit according to a third embodiment;

FIG. 15 is a flowchart showing an operation of a behavioral synthesisunit according to a third embodiment;

FIG. 16A shows a source code of a loop counter circuit;

FIG. 16B is a block diagram showing a logic configuration of a loopcounter circuit; and

FIG. 17 shows a placement example of a loop counter circuit.

DETAILED DESCRIPTION

Embodiments according to the present invention are explained hereinafterwith reference to the drawings. It should be noted that the drawings aremade in a simplified manner, and therefore the technical scope of thepresent invention should not be narrowly interpreted based on thesedrawings. Further, the same components are assigned with the samesymbols and their duplicated explanation is omitted.

In the following embodiments, when necessary, the present invention isexplained by using separate sections or separate embodiments. However,those embodiments are not unrelated with each other, unless otherwisespecified. That is, they are related in such a manner that oneembodiment is a modified example, an application example, a detailedexample, or a supplementary example of a part or the whole of anotherembodiment. Further, in the following embodiments, when the number ofelements or the like (including numbers, values, quantities, ranges, andthe like) is mentioned, the number is not limited to that specificnumber except for cases where the number is explicitly specified or thenumber is obviously limited to a specific number based on its principle.That is, a larger number or a smaller number than the specific numbermay be also used.

Further, in the following embodiments, their components (includingoperation steps and the like) are not necessarily indispensable exceptfor cases where the component is explicitly specified or the componentis obviously indispensable based on its principle. Similarly, in thefollowing embodiments, when a shape, a position relation, or the like ofa component(s) or the like is mentioned, shapes or the likes that aresubstantially similar to or resemble that shape are also included inthat shape except for cases where it is explicitly specified or they areeliminated based on its principle. This is also true for theabove-described number or the like (including numbers, values,quantities, ranges, and the like).

First Embodiment

FIG. 1 is a block diagram showing an example of a logic configuration ofa data processing apparatus 10 including a behavioral synthesis unit(behavioral synthesis apparatus) according to a first embodiment of thepresent invention. The behavioral synthesis unit according to thisembodiment performs behavioral synthesis while setting a shorter delay(stricter delay constraint; stricter timing constraint; higher clockfrequency) for a loop description that is converted into a pipeline thana loop description that is not converted into a pipeline. In this way,the behavioral synthesis unit according to this embodiment can generatesan RTL code having a higher throughput (processing capability) than thatof the related-art. The behavioral synthesis unit according to thisembodiment is explained hereinafter in a more specific manner.

The data processing apparatus 10 shown in FIG. 10 includes a behavioralsynthesis unit (behavioral synthesis apparatus) 100 and an object codegeneration unit (layout unit) 109. The behavioral synthesis unit 100includes a DFG generation unit 101, a scheduling unit 102, an allocationunit 103, an FSM generation unit 104, a data path generation unit 105, apipeline structure generation unit 106, an RTL code generation unit 107,and a pipeline determination unit 108. Note that among the components ofthe behavioral synthesis unit 100, the components other than thepipeline determination unit 108 are collectively called “synthesisunit”.

As also shown in a conceptual diagram in FIG. 2, the behavioralsynthesis unit 100 generates a finite state machine (FSM) and aplurality of data paths each of which corresponds to a respective one ofa plurality of states in the finite state machine from a code(behavioral code: hereinafter called “source code”) 11 of a circuitbehavior such as a C-language, and outputs the generated finite statemachine and the data paths as a code (structural code: hereinaftercalled “RTL code”) 14 of a circuit structure.

The DFG generation unit 101 performs syntactic analysis of the sourcecode 11 and thereby creates a DFG (Data Flow Graph) including nodesrepresenting various processing functions such as calculation andbranches representing data flows.

The pipeline determination unit 108 determines, for each loopdescription included in the source code 11, whether or not the loopdescription should be converted into a pipeline. In this embodiment, thepipeline determination unit 108 determines a loop description(s)specified by a user as a loop description(s) to be converted into apipeline(s). Note that the pipeline determination unit 108 mayautomatically determine, for each loop description, whether or not theloop description should be converted into a pipeline.

A conversion of a loop description into a pipeline is briefly explainedhereinafter with reference to FIGS. 3A to 3C. FIG. 3A is a conceptualdiagram showing a process in which a loop description (the number ofstates=4) is not converted into a pipeline. FIG. 3B is a conceptualdiagram showing a process in which four states (control steps) of a loopdescription is folded into two states and thereby converted intopipelines. FIG. 3C is a conceptual diagram showing a process in whichfour states of a loop description is folded into one state and therebyconverted into pipelines. Note that this example is explained on theassumption that the number of pipeline stages is four and the number oftimes of the loop is ten. Further, assume that in this example, oneexecution cycle (clock cycle) is required to execute one stage (one setof processes).

As shown in FIG. 3A, when a loop description (the number of states=4) isnot converted into a pipeline, firstly, four stages A1, B1, C1 and D1,which are the first loop processing, are successively executed. Afterthat, four stages A2, B2, C2 and D2, which are the second loopprocessing, are successively executed. The processing like this isrepeated until the tenth loop processing is executed. As a result, thenumber of necessary execution cycles for executing the loop processingis 40 execution cycles.

As shown in FIG. 3B, when the four states of a loop description isfolded into two states and thereby converted into pipelines, firstly,four stages A1, B1, C1 and D1, which are the first loop processing, aresuccessively executed. Further, two steps (two execution cycles) afterthe start of the first loop processing, two stages A2, B2, C2 and D2,which are the second loop processing, are successively executed.Similarly, four stages of each of the third to the tenth loop processingare successively executed two steps (two execution cycles) after thestart of the immediately-preceding loop processing. As a result, forexample, the two stages C1 and A2 are executed in parallel and the twostages D1 and B2 are executed in parallel. Further, for example, the twostages C2 and A3 are executed in parallel and the two stages D2 and B3are executed in parallel. As a result, the number of necessary executioncycles for executing the loop processing is equal to a number that isobtained by adding the number of execution cycles for the initialization(prologue) and the postprocessing (epilogue) to 18 execution cycles.

As shown in FIG. 3C, when four states of a loop description is foldedinto one state and thereby converted into pipelines, firstly, fourstages A1, B1, C1 and D1, which are the first loop processing, aresuccessively executed. Further, one step (one execution cycle) after thestart of the first loop processing, four stages A2, B2, C2 and D2, whichare the second loop processing, are successively executed. Similarly,four stages of each of the third to the tenth loop processing aresuccessively executed one step (one execution cycle) after the start ofthe immediately-preceding loop processing. As a result, for example, thefour stages D1, C2, B3 and A4 are executed in parallel and the fourstages D2, C3, B4 and A5 are executed in parallel. As a result, thenumber of necessary execution cycles for executing the loop processingis equal to a number that is obtained by adding the number of executioncycles for the initialization (prologue) and the postprocessing(epilogue) to 7 execution cycles. Note that when the number of states ofa loop description is folded into one state, if there is no descriptionother than the loop description, no finite state machine is generatedexcept for the initialization and the postprocessing.

As shown above, when a loop description is converted into a pipeline(s),the number of execution cycles is reduced in comparison to when a loopdescription is not converted into a pipeline(s). Therefore, whenbehavioral synthesis is performed while setting a short delay (strictdelay constraint) for a loop description(s) to be converted into apipeline(s), the increase in the number of execution cycles is reducedand the processing time per step is also reduced owing to the conversioninto the pipeline(s), though the number of pipeline stages increases. Asa result, the throughput improves.

Note that details of the conversion of a loop description into apipeline is also disclosed in “Takao Toi, Noritsugu Nakamura,Yoshinosuke Kato, Toru Awashima, Kazutoshi Wakabayashi, “High-levelSynthesis Challenges for Mapping a Complete Program on a DynamicallyReconfigurable Processor”, IPSJ Transaction on System LSI DesignMethodology, February, 2010, vol. 3, pp 91-104”, which was published bythe inventors of the present application.

However, when a loop description is converted into a pipeline, there isa possibility that a data hazard occurs. Therefore, it is necessary toavoid the occurrence of a data hazard. A data hazard is brieflyexplained hereinafter with reference to FIG. 4. This example isexplained by using the same conditions as those in FIG. 3C.

Firstly, four stages A1 (Read), B1 (Read), C1 (Write) and D1 (Read),which are the first loop processing, are successively executed. Further,one step (one execution cycle) after the start of the first loopprocessing, four stages A2 (Read), B2 (Read), C2 (Write) and D2 (Read),which are the second loop processing, are successively executed. Notethat since the data read process at the stage A2 is performed prior tothe data write process at the stage C1, there is a possibility thatunintended data is read. The problem like this is called “data hazard”.

In order to avoid this data hazard, forwarding (bypassing) processing iscarried out in the scheduling of the behavioral synthesis so that thedata read process at the stage A2 is prevented from being performedprior to the data write process at the stage C1. Note that details ofthe forwarding are also disclosed in “Computer Organization and Design”written by David A. Patterson and John L. Hennessy, Nikkei Business andPublications, Inc.

Referring to FIG. 1 again, the scheduling unit 102 determines, for eachof a plurality of nodes in the DFG, when the node should be executedbased on a synthesis constraint 12 and circuit information 13(scheduling), and outputs the determination results as a CDFG (ControlData Flow Graph). The allocation unit 103 determines a register and amemory unit that are used to temporarily store data represented by abranch in the CDFG based on the synthesis constraint 12 and the circuitinformation 13, and also determines which arithmetic unit should be usedfor an operation represented by a node in the CDFG.

Note that in the synthesis constraint 12, information such as a circuitscale, an amount of resources, a delay constraint (timing constraint;clock frequency), and a loop description to be converted into a pipelineis defined. Further, in the synthesis constraint 12, a delay constraintfor a multi-state circuit and a delay constraint for a pipeline circuitare defined as delay constraints. The delay constraint for a pipelinecircuit is stricter than the delay constraint for a multi-state circuit.Further, in the circuit information 13, for example, information such asthe scale and the delay of each resource (arithmetic unit 212, register213, memory unit 210, and the like) provided in an array-type processor20 (which is described later) is defined.

Note that when a loop description is synthesized as a pipeline circuit,if the delay is set to a small value (if the delay constraint is madestricter), a number of resisters are inserted. As a result, the numberof pipeline stages increases. However, since the number of states isfolded by the conversion into pipelines, the number of execution cyclesdoes not change except for the initialization (prologue) and thepostprocessing (epilogue). Therefore, in pipeline circuits, the smallervalue the delay is set to (the stricter the delay constraint is made),the more the throughput improves (processing capability).

In contrast to this, when a loop description is synthesized as amulti-state circuit without converting into a pipeline, if the delay isset to a small value (if delay constraint is made stricter), a number ofresisters are inserted. As a result, the number of states increases. Asa result, the number of execution cycles also increases. Therefore, inmulti-state circuits, when the increase in the processing time due tothe increase in the number of execution cycles exceeds the decrease inthe processing time by the reduction in the delay, the throughput(processing capability) deteriorates. In general, in multi-statecircuits, the smaller value the delay is set to (the stricter the delayconstraint is made), the larger the ratio of the total time of the setuptime and the hold time of a register, a memory, or the like becomes.Therefore, the ratio of the time spent for the calculation itselfdecreases and thus the throughput tends to deteriorate.

Therefore, the scheduling unit 102 and the allocation unit 103 performscheduling and allocation, respectively, by setting the delay constraintfor a pipeline circuit for a loop description(s) that is converted intoa pipeline(s) and setting the delay constraint for a multi-state circuitfor the other description(s). In other words, the scheduling unit 102and the allocation unit 103 perform scheduling and allocation,respectively, by setting a shorter delay (stricter delay constraint) fora loop description(s) that is converted into a pipeline(s) than a delayfor the other description(s).

As a result, although the number of pipeline stages increases and thusthe latency increases in the pipeline circuit, the increase in thenumber of execution cycles is reduced and the processing time per stepis also reduced owing to the conversion into the pipelines. Therefore,the throughput improves in comparison to the case where the delay is setto a large value. Further, the number of states is reduced and thus thenumber of execution cycles is reduced in the multi-state circuit otherthan the pipeline circuit. In addition, the total time of the setup timeand the hold time of a register, a memory, or the like is also reduced.Therefore, the throughput improves in comparison to the case where thedelay is set to a small value. That is, the overall throughput of thecircuit improves in comparison to the related art.

Next, the FSM generation unit 104 generates a finite state machine (FSM)based on the results of the scheduling unit 102 and the allocation unit103. Further, the data path generation unit 105 generates a plurality ofdata paths each of which corresponding to a respective one of aplurality of states included in the finite state machine based on theresults of the scheduling unit 102 and the allocation unit 103. Further,the pipeline structure generation unit 106 folds a plurality of statesincluded in a loop description that should be converted into a pipelineand thereby converts the loop description into a pipeline(s).

The RTL code generation unit 107 outputs the above-described finitestate machine and the plurality of data paths corresponding to therespective states included in that finite state machine as an RTL code14.

After that, the object code generation unit 109 reads the RTL code 14,generates a netlist by performing technology mapping, placing/routing,and the like, and converts the netlist into a binary code, and outputsthe binary code as an object code 15.

As described above, the behavioral synthesis unit 100 according to thisembodiment of the present invention performs behavioral synthesis whilesetting a shorter delay (stricter delay constraint) for a loopdescription that is converted into a pipeline than a loop descriptionthat is not converted into a pipeline As a result, the behavioralsynthesis unit 100 according to this embodiment can generates an RTLcode having a higher throughput (processing capability) than that of therelated-art.

[Flowchart]

Next, an operation of the behavioral synthesis unit 100 in the dataprocessing apparatus 10 is explained with reference to FIG. 5. FIG. 5 isa flowchart showing an operation of the behavioral synthesis unit 100.

Firstly, after the behavioral synthesis unit 100 receives a source code11 and performs syntactic analysis (S101), the behavioral synthesis unit100 optimizes the behavioral code language level (S102), assigns nodesrepresenting various processing functions and branches representing dataflows (S103), and thereby creates a DFG (S104).

Next, the behavioral synthesis unit 100 determines, for each loopdescription included in the source code 11, whether or not the loopdescription should be converted into a pipeline (S105) and then performsscheduling (S106) and allocation (S107) according to a synthesisconstraint 12 and circuit information 13.

Note that the behavioral synthesis unit 100 performs scheduling andallocation while setting a delay constraint for a pipeline circuit for aloop description(s) that is converted into a pipeline(s) and setting adelay constraint for a multi-state circuit for the other description(s).In other words, the behavioral synthesis unit 100 performs schedulingand allocation while setting a shorter delay (stricter delay constraint)for a loop description(s) that is converted into a pipeline(s) than adelay for the other description(s). As a result, although the number ofpipeline stages increases and thus the latency increases in the pipelinecircuit, the increase in the number of execution cycles is reduced andthe processing time per step is also reduced owing to the conversioninto the pipelines. Therefore, the throughput improves in comparison tothe case where the delay is set to a large value. Further, the number ofstates is reduced and thus the number of execution cycles is reduced inthe multi-state circuit other than the pipeline circuit. In addition,the total time of the setup time and the hold time of a register, amemory, or the like is also reduced. Therefore, the throughput improvesin comparison to the case where the delay is set to a small value. Thatis, the overall throughput of the circuit improves in comparison to therelated art.

next, the behavioral synthesis unit 100 generates a finite state machineand a plurality of data paths each of which corresponding to arespective one of a plurality of states included in that finite statemachine based on the results of the scheduling and the allocation (S108and S109). Further, the behavioral synthesis unit 100 folds a pluralityof states included in a loop description to be converted into apipeline(s) and thereby converts the loop description into a pipeline(s)(S110). After that, the behavioral synthesis unit 100 optimizes the RTLlevel and/or the logic level for the finite state machine and theplurality of data paths (S111) and then outputs the optimized finitestate machine and the data paths as an RTL code 14 (S112).

As described above, the behavioral synthesis unit 100 according to thisembodiment of the present invention performs behavioral synthesis whilesetting a shorter delay (stricter delay constraint) for a loopdescription that is converted into a pipeline than a loop descriptionthat is not converted into a pipeline. As a result, the behavioralsynthesis unit 100 according to this embodiment can generates an RTLcode having a higher throughput (processing capability) than that of therelated-art.

[Hardware Configuration Example of Data Processing Apparatus 10]

Note that the behavioral synthesis unit 100 and the data processingapparatus 10 including the same according to this embodiment of thepresent invention can be implemented, for example, by a general-purposecomputer system. A hardware configuration example is briefly explainedhereinafter with reference to FIG. 6.

FIG. 6 is a block diagram showing an example of a hardware configurationof the data processing apparatus 10 according to this embodiment of thepresent invention. A computer 110 includes, for example, a CPU (CentralProcessing Unit) 111 as a control device, a RAM (Random Access Memory)112, a ROM (Read Only Memory) 113, an IF (Inter Face) 114 as an externalinterface, a HDD (Hard Disk Drive) 115 as an example of a nonvolatilestorage device. The computer 110 may include, as other components thatare not illustrated in the figure, an input device such as a keyboardand a mouse, and a display device such as a display.

In the HDD 115, an OS (Operating System) (not shown), behavioral codeinformation 116, structural code information 117, a behavioral synthesisprogram 118 are stored. The behavioral code information 116 isinformation about the behavior of a circuit and corresponds to thesource code (behavioral code) 11 in FIG. 1. The structural information117 is information about the structure of a circuit and corresponds tothe RTL code 14 in FIG. 1. The behavioral synthesis program 118 is acomputer program in which behavioral synthesis processing according tothis embodiment is incorporated.

The CPU 111 controls various processes performed in the computer 110,access to the RAM 112, the ROM 113, the IF 114 and the HDD 115, and soon. In the computer 110, the CPU 111 reads and executes the OS and thebehavioral synthesis program 118 stored in the HDD 115. In this way, thecomputer 110 implements the behavioral synthesis unit 100 and the dataprocessing apparatus 10 including the same according to this embodimentof the present invention.

Second Embodiment

In this embodiment according to the present invention, a specificexample of a circuit to which an output result (object code 15) of thedata processing apparatus 10 is applied is explained.

FIG. 7 is a block diagram showing a configuration example of anarray-type processor (parallel arithmetic apparatus) 20 that dynamicallychanges the circuit configuration for each state based on an object code15. The array-type processor 20 shown in FIG. 7 includes an I/F unit201, a code memory 202, a state transition controlling unit 203, amatrix circuit unit 205, and a data memory unit 206. In the matrixcircuit unit 205, a plurality of processor elements (PEs) 207 arearranged in a matrix and a plurality of switch elements (SWEs) 208 arealso arranged in a matrix. The data memory unit 206 includes a pluralityof memory units 210. For example, the plurality of memory units 210 arearranged so as to surround the matrix circuit unit 205.

Note that the object code 15 includes a plurality of contexts(corresponding to a plurality of data paths) and a state transitioncondition(s) (corresponding to a finite state machine). In each context,an operation instruction for each of the plurality of processor elements207 and the plurality of switch elements 208 is defined. Further, in thestate transition condition, an operation instruction for the statetransition controlling unit 203 that selects one of the plurality ofcontexts according to the state is defined.

The object code 15 is supplied from the data processing apparatus 10 tothe I/F unit 201. The code memory 202 is composed of an informationstorage medium such as a RAM and stores the object code 15 supplied tothe I/F unit 201.

The state transition controlling unit 203 selects one of the pluralityof contexts according to the state and outputs a plurality ofinstruction pointers (IPs) to respective processor elements 207according to the selected context.

FIG. 8 shows a configuration example of a pair of a processor element207 and a switch element 208. The processor element 207 includes aninstruction memory 211, an arithmetic unit 212, and a register 213. Theswitch element 208 includes line connection switches 214 to 218. Notethat this embodiment is explained by using an example case where thearithmetic unit 212 includes only one arithmetic element (ALU). Further,each element in the processor element 207 exchanges data through a dataline and exchanges a flag through a flag line (the illustration of theselines is omitted in the figure).

The processor element 207 performs arithmetic processing on data that issupplied from another processor element 207 through a data line, andoutputs a calculation result (data) to another processor element 207through a data line. Further, the processor element 207 receives a flagfrom another processor element 207 thorough a flag line and outputs aflag to another processor element 207 thorough a flag line. For example,the processor element 207 determines the presence/absence of the startof arithmetic processing based on a flag supplied from another processorelement 207 and outputs a flag that is determined according to thearithmetic processing result to another processor element 207.

The instruction memory 211 stores a plurality of operation instructionsfor the processor elements 207 and the switch elements 208 according tothe number of the contexts. Further, one of the plurality of operationinstructions is read from the instruction memory 211 based on aninstruction pointer (IP) supplied from the state transition controllingunit 203. The processor element 207 and the switch element 208 performan operation according to the operation instruction read from theinstruction memory 211.

The arithmetic unit 212 carries out arithmetic processing on input datain accordance with an arithmetic processing content that is determinedaccording to the operation instruction read from the instruction memory211.

The register 213 temporarily stores data to be input to the arithmeticunit 212, a calculation result by the arithmetic unit 212, intermediatedata of arithmetic processing performed by the arithmetic unit 212, andthe like. Note that a calculation result of the arithmetic unit 212 maybe directly output to the outside of the processor unit without beingtemporarily stored in the register 213.

The line connection switches 214 to 216 connect, according to anoperation instruction read from the instruction memory 211, thecorresponding processor element 207 (i.e., the processor element 207including the instruction memory 211 storing that operation instruction)with another processor element 207 (e.g., an adjacent processor element207) through a data line(s).

The line connection switches 216 to 218 connect, according to anoperation instruction read from the instruction memory 211, thecorresponding processor element 207 (i.e., the processor element 207including the instruction memory 211 storing that operation instruction)with another processor element 207 (e.g., an adjacent processor element207) through a flag line(s).

Note that the line connection switches 214 to 216 connect a line(s)according to an operation instruction read from the instruction memory211. Further, the line connection switch 216 is disposed at anintersection of a data line(s) and/or a flag line(s).

[Data Processing System 1]

FIG. 9 is a block diagram showing a configuration example of a dataprocessing system 1 including a data processing apparatus 10 and anarray-type processor 20.

In the data processing system 1 shown in FIG. 9, the data processingapparatus 10 reads a source code 11, a synthesis constraint 12, andcircuit information 13 and thereby generates an object code 15. Thearray-type processor 20 carries out arithmetic processing onexternally-supplied processing data while dynamically changing thecircuit configuration for each state based on the object code 15 outputfrom the data processing apparatus 10, and outputs the resultantprocessing data as result data.

[Details of Reconfiguration of Array-Type Processor 20]

Next, details of reconfiguration of the array-type processor 20according to a delay constraint at the time of behavioral synthesis areexplained with reference to FIGS. 10A and 10B. FIG. 10A shows aconnection relation between arithmetic units 212 and registers 213 in acase where a delay constraint is not strict (for example, delayconstraint is 12 ns). FIG. 10B shows a connection relation betweenarithmetic units 212 and registers 213 in a case where a delayconstraint is strict (for example, delay constraint is 7 ns). Note thatfor the sake of simpler explanation, this example is explained on theassumption that: the delay of registers 213 is uniformly 0 ns; the delayof arithmetic units 212 is uniformly 3 ns; and the wiring delay isuniformly 2 ns. Further, the setup time and the hold time are not takeninto consideration.

Firstly, in the example shown in FIG. 10A, since the delay constraint isnot strict (i.e., 12 ns), two arithmetic units 212 are connected betweenregisters. As a result, the period of the execution cycle becomeslonger. However, the number of states is reduced and the number ofexecution cycles is thereby reduced. The behavioral synthesis for amulti-state circuit is performed while setting a lax delay constraintlike this.

In contrast to this, in the example shown in FIG. 10B, since the delayconstraint is strict (i.e., 7 ns), only one arithmetic unit 212 isconnected between registers. That is, in the example shown in FIG. 10B,another register 213 is inserted between the two arithmetic units 212 incomparison to the example shown in FIG. 10A. As a result, although thenumber of states increases and thus the number of execution cyclesincreases, the period of the execution cycle becomes shorter. Note thatin pipeline circuits, the number of states is folded and thus theincrease in the number of execution cycles is reduced. Therefore, it ispossible to achieve a high-speed operation by reducing the delay andthereby shortening the period of the execution cycle. The behavioralsynthesis for a pipeline circuit is performed while setting a strictdelay constraint like this.

Third Embodiment

In this embodiment according to the present invention, a modifiedexample of the array-type processor 20 is explained.

[Modified Example of Arithmetic Unit 212]

Firstly, a modified example of the arithmetic unit 212 provided in thearray-type processor 20 is explained with reference to FIGS. 11A to 11D.FIG. 11A is a block diagram showing the arithmetic unit 212. Further,FIGS. 11B to 11D are block diagrams showing modified examples of thearithmetic unit 212 as arithmetic units 212 b to 212 d.

An arithmetic unit 212 shown in FIG. 11A includes an arithmetic element(ALU) 2121. The arithmetic element 2121 performs arithmetic processingon input data of the arithmetic unit 212 and outputs a calculationresult.

An arithmetic unit 212 b shown in FIG. 11B includes a flip-flop in frontof an arithmetic element. Specifically, the arithmetic unit 212 bincludes an arithmetic element 2121 b, a flip-flop (F/F) 2122 b, and aselector (SEL) 2123 b. The flip-flop 2122 b takes in input data of thearithmetic unit 212 b in synchronization with a clock and outputs thetaken data. The selector 2123 b selectively outputs either the output ofthe flip-flop 2122 b or the input data of the arithmetic unit 212 baccording to the state (that is, according to operation instruction readfrom an instruction memory). The arithmetic element 2121 b performsarithmetic processing on the output of the selector 2123 b and outputs acalculation result.

An arithmetic unit 212 c shown in FIG. 11C includes a flip-flop behindan arithmetic element. Specifically, the arithmetic unit 212 c includesan arithmetic element 2121 c, a flip-flop 2122 c, and a selector 2123 c.The arithmetic element 2121 c performs arithmetic processing on inputdata of the arithmetic unit 212 c and outputs a calculation result. Theflip-flop 2122 c takes in the calculation result of the arithmeticelement 2121 c in synchronization with a clock and outputs the takencalculation result. The selector 2123 c selectively outputs either theoutput of the flip-flop 2122 c or the calculation result of thearithmetic element 2121 c according to the state.

An arithmetic unit 212 d shown in FIG. 11D includes a flip-flop betweentwo divided arithmetic elements. Specifically, the arithmetic unit 212 dincludes two divided arithmetic elements (first arithmetic element) 2121d and (second arithmetic element) 2124 d, a flip-flop 2122 d, and aselector 2123 d. The arithmetic element 2121 d performs arithmeticprocessing on input data of the arithmetic unit 212 d and outputs acalculation result (intermediate data). The flip-flop 2122 d takes inthe calculation result of the arithmetic element 2121 d insynchronization with a clock and outputs the taken calculation result.The selector 2123 d selectively outputs either the output of theflip-flop 2122 d or the calculation result of the arithmetic element2121 d according to the state. The arithmetic element 2124 d performsarithmetic processing on the output of the selector 2123 d and outputs acalculation result.

Note that the array-type processor 20 according to this embodimentincludes one of the arithmetic units 212 b to 212 d as a substitute foreach of part or all of the plurality of arithmetic units 212. As aresult, the array-type processor 20 according to this embodiment can notonly insert a register 213 between arithmetic units, but also insert aflip-flop (register) inside an arithmetic unit.

As a result, the array-type processor 20 according to this embodimentcan dynamically reconfigure a pipeline circuit(s) in which the number ofpipeline stages is increased by reducing the delay even further. Thatis, the array-type processor 20 according to this embodiment candynamically reconfigure a pipeline circuit(s) having an even-higherthroughput. Note that in this process, the behavioral synthesis unit 100performs behavioral synthesis while setting an even-shorter delay(stricter delay constraint) for a loop description(s) that is convertedinto a pipeline(s).

[Modified Example of Memory Unit]

Next, a modified example of the memory unit 210 provided in thearray-type processor 20 is explained with reference to FIGS. 12A to 12C.FIG. 12A is a block diagram showing the memory unit 210. Further, FIGS.12B and 12C are block diagrams showing modified examples of the memoryunit 210 as memory units 210 b and 210 c.

A memory unit 210 shown in FIG. 12A includes a memory (MEM) 2101. In awrite operation, data is written into a memory cell(s) specified by anaddress signal in a memory 2101. Further, in a read operation, data isread from a memory cell(s) specified by an address signal in the memory2101.

The memory unit 210 b shown in FIG. 12B includes a flip-flop in front ofa memory. Specifically, the memory unit 210 b includes a memory 2101 b,a flip-flop (F/F) 2102 b, and a selector (SEL) 2103 b. The flip-flop2102 b takes in an externally-supplied address signal andexternally-supplied write data (in the case of a write operation) insynchronization with a clock and outputs the taken address signal andthe write data. The selector 2103 b selectively outputs either theoutput of the flip-flop 2102 b (address signal and write data) or theexternally-supplied address signal and the externally-supplied writedata (in the case of a write operation) according to the state. In thememory 2101 b, data writing or data reading is performed by using theoutput of the selector 2103 b.

The memory unit 210 c shown in FIG. 12C includes a flip-flop behind amemory. Specifically, the memory unit 210 c includes a memory 2101 c, aflip-flop 2102 c, and a selector 2103 c. In the memory 2101 c,externally-supplied write data is written into a memory cell(s)specified by an externally-supplied address signal in a write operation.Further, data is read from a memory cell(s) specified by anexternally-supplied address signal in a read operation. In a readoperation, the flip-flop 2102 c takes in data read from the memory 2101c in synchronization with a clock and outputs the taken data. Theselector 2103 c selectively outputs either the output of the flip-flop2102 c or the data read from the memory 2101 c according to the state.

Note that the array-type processor 20 according to this embodimentincludes one of the memory units 210 b and 210 c as a substitute foreach of part or all of the plurality of memory units 210 thatconstitutes the data memory unit 206. As a result, the array-typeprocessor 20 according to this embodiment can not only insert a register213 between arithmetic units and/or between an arithmetic unit and amemory unit, but also insert a flip-flop (register) inside a memoryunit.

As a result, the array-type processor 20 according to this embodimentcan dynamically reconfigure a pipeline circuit(s) in which the number ofpipeline stages is increased by reducing the delay even further. Thatis, the array-type processor 20 according to this embodiment candynamically reconfigure a pipeline circuit(s) having an even-higherthroughput. Note that in this process, the behavioral synthesis unit 100performs behavioral synthesis while setting an even-shorter delay(stricter delay constraint) for a loop description(s) that is convertedinto a pipeline(s).

Other Modified Examples

Next, other modified examples of the array-type processor 20 areexplained with reference to FIGS. 13A and 13B. In this example, aplurality of register units 209 each of which includes a flip-flop and aselector are provided on a data line(s) in the matrix circuit unit 205.Similarly, a plurality of register units 209 are also provided on a flagline(s) in the matrix circuit unit 205.

FIG. 13A shows a configuration example of a plurality of register units209. FIG. 13B shows a part of the array-type processor 20 that isdynamically reconfigured by using the register units 209.

As shown in FIG. 13A, a plurality of register units 209 each including aflip-flop and a selector are provided on a data line. The selectorchanges whether input data is output through the flip-flop or theflip-flop is bypassed according to the state. For example, it ispossible to change the places on a data line(s) at which flip-flops areinserted as desired by bringing arbitrarily-selected register units 209among the plurality of register units 209 into an enabled state.

In the example shown in FIG. 3B, the flip-flop of one of the pluralityof register units 209 is brought into an enabled state and therebyinserted between the preceding register (REG1) 213 and the arithmeticunit 212. By doing so, the wiring delay between the preceding register213 and the arithmetic unit 212 is reduced. For example, the flip-flopis inserted in such a position that the wiring delay between thepreceding register 213 and the arithmetic unit 212 is roughly equal tothe wiring delay between the arithmetic unit 212 and the subsequentregister 213.

In this manner, it is possible to change the insertion places on a dataline(s) at which flip-flops are inserted as desired in the array-typeprocessor 20 according to this embodiment. As a result, the array-typeprocessor 20 according to this embodiment can dynamically reconfigure apipeline circuit(s) in which the number of pipeline stages is increasedby reducing the delay even further. That is, the array-type processor 20according to this embodiment can dynamically reconfigure a pipelinecircuit(s) having an even-higher throughput. Further, it is alsopossible to optimize the overall delay of the circuit. Note that in thisprocess, the data processing apparatus 10 determines the above-describedflip-flop insertion places when placing/routing processing is performedin the object code generation unit 109.

Note that details of a configuration in which a plurality of registerunits 209 are provided on a data line is also disclosed in “D. Singh, S.Brown, “The case for registered routing switches in field programmablegate arrays”, Proceedings ACM/SIGDA International Symposium onField-Programmable Gate Arrays, February, 2001, pp. 161-169”.

Although this example is explained by using an example case where theregister unit 209 includes a flip-flop and a selector, the register unitis not limited to this configuration. The register unit 209 may includeonly a flip-flop.

Next, a behavioral synthesis flow for the array-type processor 20according to this embodiment of the present invention is explained withreference to FIGS. 14 and 15. FIG. 14 is a flowchart showing a firstoperation of the behavioral synthesis unit 100 according to thisembodiment. FIG. 15 is a flowchart showing a second operation of thebehavioral synthesis unit 100 according to this embodiment.

[First Flowchart]

In the example shown in FIG. 14, the behavioral synthesis unit 100 readscircuit information 13A instead of the circuit information 13. Thecircuit information 13A includes circuit information for a pipelinecircuit and circuit information for a multi-state circuit. In thecircuit information for a pipeline circuit, information of resourceshaving a relatively short delay (arithmetic units 212 b and 212 c,memory units 210 b and 210 c, register 213, and the like) among theresources provided in the array-type processor 20 is defined. Meanwhile,in the circuit information for a multi-state circuit, information ofresources having a relatively long delay (arithmetic unit 212, memoryunit 210, register 213, and the like) among the resources provided inthe array-type processor 20 is defined.

This behavioral synthesis unit 100 performs scheduling and allocation bysetting a delay constraint and circuit information for a pipelinecircuit for a loop description(s) that is to be converted into apipeline(s) and setting a delay constraint and circuit information for amulti-state circuit for the other description(s) (S106 and S107). Inother words, the behavioral synthesis unit 100 performs scheduling andallocation by setting a shorter delay constraint and a resource(s)having a shorter delay for a loop description(s) that is converted intoa pipeline(s) than those for the other description(s).

The other operation of the behavioral synthesis unit 100 shown in FIG.14 is similar to the operation shown in FIG. 5, and therefore itsexplanation is omitted.

[Second Flowchart]

In the example shown in FIG. 15, the behavioral synthesis unit 100 alsoperforms the optimization at an RTL level and/or a logic level aftergoing through similar operations to those shown in FIG. 5 (S111). Afterthat, the behavioral synthesis unit 100 inserts additional flip-flopsfor the circuit section to be converted into pipelines (S1112), and thenoutputs the resultant circuit as an RTL code 14 (S112)

In the operation shown in FIG. 15, there is no need to prepare two typesof circuit information pieces in contrast to the case shown in FIG. 14.However, in this case, it is necessary to pay attention so that noadditional flip-flop is inserted for a pipeline circuit in which a datahazard could occur.

Fourth Embodiment

In this embodiment according to the present invention, placing/routingof a circuit in which a data hazard occurs due to a conversion of a loopdescription into a pipeline is explained.

As already explained above with reference to FIG. 4 and the like, a datahazard occurs when the order of a data write process and a data readprocess or another data write process is reversed. Therefore, a datahazard tends to occur in a circuit description in which a variable isreferred to by the variable itself. Specifically, a data hazard tends tooccur in a loop counter circuit and the like in which a variable isreferred to by the variable itself.

FIG. 16A shows a source code of a loop counter circuit 300 and FIG. 16Bshows its logic configuration.

As shown in FIG. 16B, the loop counter circuit 300 includes a selector(SEL) 301, an addition circuit 302, a comparison circuit 303, andregisters 304 to 306. The registers 304 to 306 store a value 1, a valuex (arbitrary natural number), and the maximum value of x (max)respectively.

The addition circuit 302 adds the value 1 and the value x (initial value0), and outputs the addition result “1”. The selector 301 selects andoutputs the addition result “1” of the addition circuit 302 during theloop processing. The register 305 takes in the output “1” of theselector 301 in synchronization with a clock and outputs the takenoutput “1”. As a result, the addition circuit 302 adds the value 1 andthe value x (value 1), and outputs the addition result “2”. Theoperation like this is repeated. Then, when a relation “x>max” issatisfied, the comparison circuit 303 changes its output value from theinitial value to a different value. As a result, the loop processing isfinished. Note that when the loop processing is not being performed, theselector 301 supplies the output of the register 305 directly to theinput of the register 305.

Note that when forwarding processing is carried out for the loopdescription of the loop counter circuit 300 in the scheduling of thebehavioral synthesis, the writing and the reading of the register 305are scheduled within the number of states to be folded. Note that forthe sake of an easier explanation, the following example is explained byusing an example case where the write processing and the read processingof the register 305 are scheduled in the same state. Therefore, it isimpossible to increase the number of pipeline stages by inserting aflip-flop (register) in front of or behind the addition circuit 302 orthe selector 301 (however, it is possible to insert a flip-flop(register) in front of or behind the comparison circuit 303). That is,this loop description is behavior-synthesized as a combinational circuitthat operates within one execution cycle.

Therefore, when forwarding processing is carried out for a loopdescription in which a data hazard could occur, the data processingapparatus 10 according to this embodiment sets a flag to a group oflogic circuits generated based on that loop description (in the exampleshown in FIG. 16B, selector 301, addition circuit 302, register 305, andthe like). More specifically, the group of logic circuits are configuredso that each of those logic circuits outputs an identifiable signalhaving a predetermined level. Further, the data processing apparatus 10places the group of logic circuits to which the flag is set close toeach other so that the wiring delays are reduced as much as possiblewhen the placing/routing processing is performed in the object codegeneration unit 109. By doing so, the data processing apparatus 10 canreduce the processing time of the pipeline circuit on which theforwarding processing has been carried out.

FIG. 17 shows a placement example of a part of the loop counter circuit300. As shown in FIG. 17, for example, the loop counter circuit 300 isdynamically reconfigured by using mutually-adjacent processor elements207.

Note that in the array-type processor 20, the placing/routing processingis performed based on relatively large circuit units such as a look-uptable and a processor element (PE) in comparison to gate-array LSIs(Large Scale Integrations), cell-based LSIs, and the likes. Therefore,performing behavioral synthesis with consideration given to themutually-adjacent placement is effective for improving the throughput.

Fifth Embodiment

In the array-type processor 20 according to the above-described first tofourth embodiments, a pipeline circuit (s) operates in synchronizationwith a clock having a higher frequency than that for the othercircuit(s) (multi-state circuit(s)). That is, the pipeline circuit andthe multi-state circuit operate in synchronization with clocks havingmutually-different frequencies. An array-type processor 20 according tothis embodiment of the present invention also dynamically changes, whenthe circuit delay changes according to the state, the frequency of aclock according to the maximum delay (critical path) of the circuit ineach state.

Note that a method for changing a circuit delay according to the stateis disclosed, for example, in Japanese Patent No. 4753895.

Meanwhile, as an example of a method for dynamically changing thefrequency of a clock, there is a method in which one of a plurality ofclock supply lines is selected according to the state and the clock ofthe selected clock supply line is supplied to a correspondingcircuit(s). However, in this method, the number of clock supply linesincreases and thus the circuit is crowded with the lines. Therefore, thenumber of types of clock frequencies cannot be increased so much.Further, this method requires additional switches for switching theclock supply line. Therefore, as another example of a method fordynamically changing the frequency of a clock, there is a method inwhich a clock supply source generates a clock having a frequency that isdetermined according to the state and the generated clock is suppliedthrough one clock supply line. For example, International PatentPublication No. WO2009/116398 discloses this method.

As described above, the array-type processor 20 according to thisembodiment of the present invention can dynamically change, when thecircuit delay changes according to the state, the frequency of a clockaccording to the maximum delay (critical path) of the circuit in eachstate regardless of whether the circuit is a pipeline circuit, amulti-state circuit, or a pipeline circuit having a plurality of states.

As described above, the behavioral synthesis unit (behavioral synthesisapparatus) 100 according to the above-described embodiments of thepresent invention performs behavioral synthesis while setting a shorterdelay (stricter delay constraint) for a loop description that isconverted into a pipeline than a loop description that is not convertedinto a pipeline. As a result, although the number of pipeline stagesincreases and thus the latency increases in the pipeline circuit, theincrease in the number of execution cycles is reduced and the processingtime per step is also reduced owing to the conversion into the pipeline.Therefore, the throughput improves. Further, the number of states isreduced and thus the number of execution cycles is reduced in themulti-state circuit other than the pipeline circuit. In addition, thetotal time of the setup time and the hold time of a register, a memory,or the like is also reduced. Therefore, the throughput improves. Thatis, the behavioral synthesis unit 100 according to the above-describedembodiments can improve the overall throughput of the circuit incomparison to the related art.

Further, the array-type processor (parallel processing device) 20according to the above-described embodiments includes, for example, anarithmetic unit including a flip-flop inside thereof, a memory unit, anda register unit. As a result, the array-type processor 20 according tothe above-described embodiments can dynamically reconfigure a pipelinecircuit(s) in which the number of pipeline stages is increased byreducing the delay even further. That is, the array-type processor 20according to the above-described embodiments can dynamically reconfigurea pipeline circuit(s) having an even-higher throughput.

Further, the data processing apparatus 10 according to according to theabove-described embodiments sets, when forwarding processing is carriedout for a loop description in which a data hazard could occur, a flag toa group of logic circuits generated based on that loop description.Further, the data processing apparatus 10 according to according to theabove-described embodiments places the group of logic circuits to whichthe flag is set close to each other so that the wiring delays arereduced as much as possible when the placing/routing processing isperformed. By doing so, the data processing apparatus 10 according toaccording to the above-described embodiments can reduce the processingtime of the circuit on which the forwarding processing has been carriedout. That is, it is possible to improve the throughput.

Further, the array-type processor 20 according to the above-describedembodiments of the present invention can dynamically change, when thecircuit delay changes according to the state, the frequency of a clockaccording to the maximum delay (critical path) of the circuit in eachstate regardless of whether the circuit is a pipeline circuit, amulti-state circuit, or a pipeline circuit having a plurality of states.

Further, in the behavioral synthesis unit and the data processingapparatus including the same according to the above-describedembodiments of the present invention, arbitrary processing can be alsoimplemented by causing a CPU (Central Processing Unit) to execute acomputer program.

In the above-described examples, the program can be stored and providedto a computer using any type of non-transitory computer readable media.Non-transitory computer readable media include any type of tangiblestorage media. Examples of non-transitory computer readable mediainclude magnetic storage media (such as floppy disks, magnetic tapes,hard disk drives, etc.), optical magnetic storage media (e.g.magneto-optical disks), CD-ROM (compact disc read only memory), CD-R(compact disc recordable), CD-R/W (compact disc rewritable), DVD(Digital Versatile Disc), BD (Blue-ray (registered trademark) Disc), andsemiconductor memories (such as mask ROM, PROM (programmable ROM), EPROM(erasable PROM), flash ROM, RAM (random access memory), etc.). Theprogram may be provided to a computer using any type of transitorycomputer readable media. Examples of transitory computer readable mediainclude electric signals, optical signals, and electromagnetic waves.Transitory computer readable media can provide the program to a computervia a wired communication line (e.g. electric wires, and optical fibers)or a wireless communication line.

The present invention made by the inventors of the present applicationhas been explained above in a concrete manner based on embodiments.However, the present invention is not limited to the above-describedembodiments, and needless to say, various modifications can be madewithout departing from the spirit and scope of the present invention.

The first to fifth embodiments can be combined as desirable by one ofordinary skill in the art.

While the invention has been described in terms of several embodiments,those skilled in the art will recognize that the invention can bepracticed with various modifications within the spirit and scope of theappended claims and the invention is not limited to the examplesdescribed above.

Further, the scope of the claims is not limited by the embodimentsdescribed above.

Furthermore, it is noted that, Applicant's intent is to encompassequivalents of all claim elements, even if amended later duringprosecution.

What is claimed is:
 1. A behavioral synthesis apparatus comprising: atleast one memory operable to store program instruction; at least oneprocessor operable to read said program instruction and configured bythe program instruction to: determine whether or not a loop descriptionshould be converted into a pipeline; read constraint information forperforming behavioral synthesis including a first delay constraint for apipeline circuit and a second delay constraint for a circuit other thanthe pipeline circuit, the first delay constraint being stricter than thesecond delay constraint; and perform behavioral synthesis based on theread constraint information by setting a stricter delay constraint forthe loop description that is converted into the pipeline than the loopdescription that is not converted into the pipeline.
 2. The behavioralsynthesis apparatus according to claim 1, wherein when forwardingprocessing is carried out for the loop description to be converted intothe pipeline, the at least one processor further configured to set aflag to a group of logic circuits generated based on that loopdescription.
 3. A data processing apparatus comprising: a behavioralsynthesis apparatus according to claim 1; and a layout processorconfigured to synthesize a logic circuit from a structural code outputfrom the behavioral synthesis apparatus and then performsplacing/routing.
 4. A data processing apparatus comprising: a behavioralsynthesis apparatus according to claim 2; and a layout processorconfigured to synthesize a logic circuit from a structural code outputfrom the behavioral synthesis apparatus and then performsplacing/routing, wherein the layout processor places the group of logiccircuits to which the flag is set close to each other.
 5. A dataprocessing system comprising: a data processing apparatus according toclaim 3; and a parallel arithmetic apparatus in which a circuit isdynamically configured according to a state based on a netlist outputfrom the data processing apparatus.
 6. A data processing systemcomprising: a data processing apparatus according to claim 4; and aparallel arithmetic apparatus in which a circuit is dynamicallyconfigured according to a state based on a netlist output from the dataprocessing apparatus.
 7. The data processing system according to claim5, wherein the netlist includes a finite state machine and a pluralityof data paths each corresponding to a respective one of a plurality ofstates included in the finite state machine, and the parallel arithmeticapparatus comprises: a plurality of processor elements configured to:select a context according to a state of the finite state machine fromamong a plurality of contexts each corresponding to a respective one ofthe plurality of data paths; determine an arithmetic processing contentbased on the selected context; and a plurality of switch elements eachof which determines a connection relation of a respective one of theplurality of processor elements based on the selected context.
 8. Thedata processing system according to claim 7, wherein each of theplurality of processor elements comprises: an instruction memory thatstores a plurality of operation instructions each corresponding to arespective one of the plurality of contexts, the instruction memorybeing configured so that an operation instruction corresponding to thecontext selected by the state transition controlling unit among theplurality of operation instructions is read from the instruction memory;an arithmetic unit that performs arithmetic processing on input data inaccordance with an arithmetic processing content according to theoperation instruction; and a register that temporarily stores at leastone of the input data, a calculation result by the arithmetic unit, andintermediate data of arithmetic processing performed by the arithmeticunit.
 9. The data processing system according to claim 8, wherein atleast one of the arithmetic unit comprises: a register that temporarilystores the input data, the input data being supplied from outside of thearithmetic unit; a selector that selectively outputs either the inputdata that is supplied from outside of the arithmetic unit or the inputdata stored in the register according to the operation instruction; andan arithmetic element that performs arithmetic processing on data outputfrom the selector in accordance with an arithmetic processing contentaccording to the operation instruction.
 10. The data processing systemaccording to claim 8, wherein at least one of the arithmetic unitcomprises: an arithmetic element that performs arithmetic processing onthe input data in accordance with an arithmetic processing contentaccording to the operation instruction; a register that temporarilystores a calculation result by the arithmetic element; and a selectorthat selectively outputs either the calculation result of the arithmeticelement or the calculation result stored in the register according tothe operation instruction.
 11. The data processing system according toclaim 8, wherein at least one of the arithmetic unit comprises: a firstarithmetic element that performs arithmetic processing on the input datain accordance with an arithmetic processing content according to theoperation instruction and thereby generate intermediate data; a registerthat temporarily stores the intermediate data; a selector thatselectively outputs either the intermediate data output from the firstarithmetic element or the intermediate data stored in the registeraccording to the operation instruction; and a second arithmetic elementthat performs arithmetic processing on data output from the selector inaccordance with an arithmetic processing content according to theoperation instruction.
 12. The data processing system according to claim7, wherein the parallel arithmetic apparatus further comprises aplurality of memory units that store output results of the plurality ofprocessor elements, and at least one of the plurality of memory unitscomprises: a register that temporarily stores an address signal anddata, the address signal being supplied from outside of the memory unit,and the data being supplied from outside of the memory unit in a writeoperation; a selector that selectively outputs either the address signaland the data that are supplied from outside of the memory unit or theaddress signal and the data stored in the register according to a state;and a memory in which data output from the selector is written into amemory cell specified by an address signal output from the selector ordata stored in a memory cell specified by an address signal output fromthe selector is read.
 13. The data processing system according to claim7, wherein the parallel arithmetic apparatus further comprises aplurality of memory units that store output results of the plurality ofprocessor elements, and at least one of the plurality of memory unitscomprises: a memory in which data is written into a memory cellspecified by an address signal or data stored in a memory cell specifiedby an address signal is read; a register that temporarily stores dataread from the memory in a read operation; and a selector thatselectively outputs either the data read from the memory or the datastored in the register according to a state.
 14. The data processingsystem according to claim 7, wherein the parallel arithmetic apparatusfurther comprises a plurality of register units provided on a data lineconnecting the plurality of processor elements, and each of theplurality of register units comprises: a register that temporarilystores input data, the input data being supplied from outside of theregister unit; and a selector that selectively outputs either the inputdata that is supplied from outside of the register unit or the inputdata stored in the register according to a state.
 15. The dataprocessing system according to claim 5, wherein the parallel arithmeticapparatus generates, in each state, a clock having a frequency accordingto a maximum delay of a circuit to be configured and supplies thegenerated clock to that circuit.
 16. The data processing systemaccording to claim 5, wherein the layout processor synthesizes a logiccircuit by using a circuit resource provided in the parallel arithmeticapparatus and performs placing/routing.
 17. A behavioral synthesismethod of performing behavioral synthesis comprising: determiningwhether or not a loop description should be converted into a pipeline;reading constraint information for performing behavioral synthesisincluding a first delay constraint for a pipeline circuit and a seconddelay constraint for a circuit other than the pipeline circuit, thefirst delay constraint being stricter than the second delay constraint;and performing behavioral synthesis based on the read constraintinformation by setting a stricter delay constraint for the loopdescription that is converted into the pipeline than the loopdescription that is not converted into the pipeline.
 18. The behavioralsynthesis method according to claim 17, further comprising: setting,when forwarding processing is carried out for the loop description to beconverted into a pipeline, a flag to a group of logic circuits generatedbased on that loop description; and performing the behavioral synthesisafter setting the flag.
 19. A non-transitory computer readable mediumstoring a behavioral synthesis program that causes a computer to executea method comprising: determining whether or not a loop descriptionshould be converted into a pipeline; reading constraint information forperforming behavioral synthesis including a first delay constraint for apipeline circuit and a second delay constraint for a circuit other thanthe pipeline circuit, the first delay constraint being stricter than thesecond delay constraint; and performing behavioral synthesis based onthe read constraint information by setting a stricter delay constraintfor the loop description that is converted into the pipeline than theloop description that is not converted into the pipeline.
 20. Thenon-transitory computer readable medium storing a behavioral synthesisprogram according to claim 19, wherein the program further causes acomputer to execute a flag setting process of setting, when forwardingprocessing is carried out for the loop description to be converted intoa pipeline, a flag to a group of logic circuits generated based on thatloop description, and performing the behavioral synthesis is performedafter the flag setting process.